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Abstract 

How can unlabeled video augment visual learning? Ex¬ 
isting methods perform “slow” feature analysis, encourag¬ 
ing the representations of temporally close frames to exhibit 
only small differences. While this standard approach cap¬ 
tures the fact that high-level visual signals change slowly 
over time, it fails to capture how the visual content changes. 
We propose to generalize slow feature analysis to “steady” 
feature analysis. The key idea is to impose a prior that 
higher order derivatives in the learned feature space must 
be small. To this end, we train a convolutional neural net¬ 
work with a regularizer on tuples of sequential frames from 
unlabeled video. It encourages feature changes over time 
to be smooth, i.e., similar to the most recent changes. Us¬ 
ing five diverse datasets, including unlabeled YouTube and 
KITTI videos, we demonstrate our method’s impact on ob¬ 
ject, scene, and action recognition tasks. We further show 
that our features learned from unlabeled video can even sur¬ 
pass a standard heavily supervised pretraining approach. 

1. Introduction 

Visual feature learning with deep neural networks has 
yielded dramatic gains for image recognition tasks in re¬ 
cent years ll23l[38l . While the main techniques involved in 
these methods have been known for some time, a key factor 
in their recent success is the availability of large human- 
labeled image datasets like ImageNet J6). Deep convolu¬ 
tional neural networks (CNNs) designed for image recog¬ 
nition typically have millions of parameters, necessitating 
notoriously large training databases to avoid overfitting. 

Intuitively, however, visual learning should not be re¬ 
stricted to sets of category-labeled exemplars. Taking hu¬ 
man learning as an obvious example, children build up vi¬ 
sual representations through constant observation and ac¬ 
tion in the world. This hints that machine-learned repre¬ 
sentations would also be well served to exploit long-term 
video observations, even in the absence of deliberate labels. 
Indeed, researchers in cognitive science find that temporal 
coherence plays an important role in visual learning. For 
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Figure 1: From unlabeled videos, we learn “steady features” that 
exhibit consistent feature transitions among sequential frames. 


example, altering the natural temporal contiguity of visual 
stimuli hinders translation invariance in the inferior tempo¬ 
ral cortex 03, and functions learned to preserve temporal 
coherence share behaviors observed in complex cells of the 
primary visual cortex 0. 

Our goal is to exploit unlabeled video, as might be ob¬ 
tained freely from the web, to improve visual feature learn¬ 
ing. In particular, we are interested in improving learned 
image representations for visual recognition tasks. 

Prior work leveraging video for feature learning focuses 
on the concept of slow feature analysis (SFA). First for¬ 
mally proposed in j43l , SFA exploits temporal coherence in 
video as “free” supervision to learn image representations 
invariant to small transformations. In particular, SFA en¬ 
courages the following property: in a learned feature space, 
temporally nearby frames should lie close to each other, i.e. 
for a learned representation z and adjacent video frames 
a and b, one would like z (a) « z (b). The rationale be¬ 
hind SFA rests on a simple observation: high-level seman¬ 
tic visual concepts associated with video frames typically 
change only gradually as a function of the pixels that com¬ 
pose the frames. Thus, representations useful for recog¬ 
nizing high-level concepts are also likely to possess this 
property of “slowness”. Another way to think about this 
is that scene changes between temporally nearby frames 
are usually small and represent label-preserving transforma¬ 
tions. A slow representation will tolerate minor geometric 
or lighting changes, which is essential for high-level visual 
recognition tasks. The value of exploiting temporal coher¬ 
ence for recognition has been repeatedly verified in ongoing 
research, including via modern deep convolutional neural 
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network implementations lf3T] [3 [131 03 EH @2|. 

However, existing approaches require only that high- 
level visual signals change slowly over time. Crucially, they 
fail to capture how the visual content changes over time. In 
contrast, our idea is to incorporate the steady visual dynam¬ 
ics of the world, learned from video. For instance, if trained 
on videos of walking people, slow feature-based approaches 
would only require that images of people in nearby poses be 
mapped close to one another. In contrast, we aim to learn a 
feature space in which frames from a novel video of a walk¬ 
ing person would follow a smooth, predictable trajectory. 
A learned steady representation capturing such dynamics 
would be influenced not only by object motions, but also 
other types of visual transformations. For instance, it would 
capture how colors of objects in the sunlight change over the 
course of a day, or how the views of a static scene change 
as a camera moves around it. 

To this end, we propose steady feature analysis —a gen¬ 
eralization of slow feature learning. The key idea is to im¬ 
pose higher order temporal constraints on the learned vi¬ 
sual representation. Beyond encouraging temporal coher¬ 
ence i.e., small feature differences between nearby frame 
pairs, we would like to encourage consistent feature tran¬ 
sitions across sequential frames. In particular, to preserve 
second order slowness, we look at triplets of temporally 
close frames a, b, c, and encourage the learned represen¬ 
tation to have z (b) — z(a) « z(c) — z(6). We develop a 
regularizer that uses contrastive loss over tuples of frames to 
achieve such mappings with CNNs. Whereas slow feature 
learning insists that the features not change too quickly, the 
proposed steady learning insists that—in whichever way the 
features are evolving—they continue to evolve in that same 
way in the immediate future. See Figure Q] 

We hypothesize that higher-order temporal coherence 
could provide a valuable prior for recognition by embedding 
knowledge of the rich dynamics of the visual world into 
the feature space. We empirically verify this hypothesis us¬ 
ing five datasets for a variety of recognition tasks, including 
object instance recognition, large-scale scene recognition, 
and action recognition from still images. In each case, by 
augmenting a small set of labeled exemplars with unlabeled 
video, the proposed method generalizes better than both a 
standard discriminative CNN as well as a CNN regularized 
with existing slow temporal coherence metrics mm. Our 
results reinforce that unsupervised feature learning from un¬ 
constrained video is an exciting direction, with promise to 
offset the large labeled data requirements of current state- 
of-the-art computer vision approaches by exploiting virtu¬ 
ally unlimited unlabeled video. 

2. Related Work 

To build a robust object recognition system, the image 
representation must incorporate some degree of invariance 


to changes in pose, illumination, and appearance. While 
invariance can be manually crafted, such as with spatial 
pooling operations or gradient descriptors, it may also be 
learned. One approach often taken in the convolutional neu¬ 
ral network (CNN) literature is to pad the training data by 
systematically perturbing raw images with label-preserving 
transformations (e.g., translation, scaling, intensity scaling, 
etc.) 137] |39l [51 . A good representation will ensure that the 
jittered versions originating from the same content all map 
close by in the learned feature space. 

In a similar spirit, unlabeled video is an appealing re¬ 
source for recovering invariance. The simple fact that things 
typically cannot change too quickly from frame to frame 
makes it possible to harvest sets of sequential images whose 
learned representations ought not to differ substantially. 
Slow feature analysis (SFA) 03 H7| leverages this notion 
to learn features from temporally adjacent video frames. 

Recent work uses CNNs to explore the power of learn¬ 
ing slow features, also referred to as “temporally coher¬ 
ent’' features PTH 3 [47l fI3 l42l . The existing methods ei¬ 
ther produce a holistic image embedding 1311 [3 E3 11311 . 
or else track local patches to learn a localized representa¬ 
tion 030301. Most methods exploit the learned fea¬ 
tures for object recognition 0T1 1471 131 03 . while others em¬ 
ploy them for dimensionality reduction ED or video frame 
retrieval 113. In ED, a standard deep CNN architecture 
is augmented with a temporal coherence regularizer, then 
trained using video of objects on clean backgrounds rotat¬ 
ing on a turntable. The method of (3 builds on this con¬ 
cept, proposing the use of decorrelation to avoid trivial so¬ 
lutions to the slow feature criterion, with applications to 
handwritten digit classification. The authors of 113 pro¬ 
pose injecting an auto-encoder loss and explore training 
with unlabeled YouTube video. Building on SFA subspace 
ideas 03 . researchers have also examined slow features for 
action recognition 031 . facial expression analysis 031 . fu¬ 
ture prediction 03 . and temporal segmentation 1321128 1. 

Related to all the above methods, we aim to learn fea¬ 
tures from unlabeled video. However, whereas all the past 
work aims to preserve feature slowness, our idea is to pre¬ 
serve higher order feature steadiness. Our learning objec¬ 
tive is the first to move beyond adjacent frame neighbor¬ 
hoods, requiring not only that sequential features change 
gradually, but also that they change in a similar manner in 
adjacent time intervals. 

Another class of methods learns transformations 133 
[29ll34l . Whereas the above feature learning methods (and 
ours) train with unlabeled video spanning various unspeci¬ 
fied transformations, these methods instead train with pairs 
of images for which the transformation is known and/or 
consistent. Then, given a novel input, the model can be used 
to predict its transformed output. Rather than use learned 
transformations for extrapolation like these approaches, our 


goal is to exploit transformation patterns in unlabeled video 
to learn features that are useful for recognition. 

Aside from inferring the transformation that implicitly 
separates a pair of training instances, another possibility is 
to explicitly predict the transformation parameters. Recent 
work considers how the camera’s ego-motion (e.g., as ob¬ 
tained from inertial sensors, GPS) can be exploited as su¬ 
pervision during CNN training lfl~8l l2l . These methods also 
lack the higher-order relationships we propose. Further¬ 
more, they require training data annotated with camera/ego- 
pose parameters, which prevents them from learning with 
“in the wild’’ videos (like YouTube) for which the camera 
was not instrumented with external sensors to record motor 
changes. In contrast, our method is free to exploit arbitrary 
unlabeled video data. 

Several recent papers 0EDH3I have trained unsuper¬ 
vised image representations targeting specific narrow tasks, 
a learn efficient generative codes to synthesize images, 
while PTTIl learn features to predict pixel-level optical flow 
maps for video frames. Contemporary with an earlier ver¬ 
sion of our work ed, m proposed to learn features that 
vary linearly in time, for the specific task of extrapolating 
future video frames given a pair of past frames. They re¬ 
port qualitative results for toy video frame synthesis. While 
our formulation also encourages collinearity in the feature 
space, our aim is to learn generally useful features from real 
videos without supervision, and we report results on natural 
image scene, object, and action recognition tasks. 

3. Approach 

Given auxiliary raw unlabeled video, we wish to learn an 
embedding amenable to a supervised classification task. We 
pose this as a feature learning problem in a convolutional 
neural network, where the hidden layers of the network are 
tuned not only with the backpropagation gradients from a 
classification loss, but also with gradients computed from 
the unlabeled video that exploit its temporal steadiness. 

3.1. Notation and framework overview 

A supervised training dataset S = Hi)} provides 
target class labels yt £ y = [1,2, ..,C] for images x t £ 
X (represented in pixel space). The unsupervised training 
dataset U — {Xt} consists of ordered video frames, where 
Xt is the video frame at time instant /[] 

Importantly, we do not assume that the video li necessar¬ 
ily stems from the same categories or even the same domain 
as images in S. For example, in results we will demonstrate 
cases where S and U consist of natural scene images and 
autonomous vehicle video, respectively; or Web photos of 

1 For notational simplicity, we will describe our method assuming that 
the unsupervised training data is drawn from a single continuous video, but 
it is seamless to train instead with a batch of unlabeled video clips. 


human actions and YouTube video spanning dozens of dis¬ 
tinct activities. The idea is that training with diverse unla¬ 
beled video should allow the learner to recover fundamental 
cues about how objects move, how scenes evolve over time, 
how occlusions occur, how illumination varies, etc., inde¬ 
pendent of their specific semantic content. 

The full image-pixels-to-class label classifier we learn 
will have the compositional form ye,w = fw oz e(-)i where 
z e : X 7 Z D is a /7-dimensional feature map operating 
on images in the pixel space, and fw '■ Tl D —>• V takes as 
input the feature map z g(x), and outputs the class estimate. 
We learn a linear classifier fw represented by a C x D 
weight matrix W with rows w i,..., wc- At test time, a 
novel image is classified as ye,w = argmruq wfzg(x). 

To learn the classifier ye.w, we optimize an objective 
function of the form: 

(6*, W*) = argminL s (0, W,S) + \L u (0,lA), (1) 

e,w 

where L s (.) represents the supervised classification loss, 
L u (.) represents an unsupervised regularization loss term, 
and A is the regularization hyperparameter. The parameter 
vector 9 is common to both losses because they are both 
computed on the learned feature space zg(.). The super¬ 
vised loss is a softmax loss: 

1 Ns 

L s {6, W, S) = -—^\°g{cj Vi {Wzg{x i )), (2) 

i =1 

where a Vi (.) is the softmax probability of the correct class 
and N s is the number of labeled training instances in S. 

In the following, we first discuss how the unsupervised 
regularization loss L u (.) may be constructed to exploit tem¬ 
poral smoothness in video (Sec 13.2b . Then we generalize 
this to exploit temporal steadiness and other higher order 
coherence (Sec 13.3I >. Sec 13.41 then shows how a neural net¬ 
work corresponding to ye,w may t> e trained to minimize 
Eq © above. 

3.2. Review: First-order temporal coherence 

As discussed above, slow feature analysis (SFA) EH 
seeks to learn image features that vary slowly over the 
frames of a video, with the aim of learning useful invari¬ 
ances. This idea of exploiting “slowness” or “temporal co¬ 
herence” for feature learning has been explored in the con¬ 
text of neural networks EHISlElEllia. We briefly re¬ 
view that underlying objective before introducing the pro¬ 
posed higher order generalization of temporal coherence. 

A temporal neighbor pair dataset U 2 is first constructed 
from the unlabeled video U , as follows: 

U 2 = {((j, k),p jk ) -Xj.Xk £ U and 

Pjk = 1(0 <j-k< T)}, (3) 



where T is the temporal neighborhood size, and the sub¬ 
script 2 signifies that the set consists of pairs. U 2 indexes 
image pairs with neighbor-or-not binary annotations pj k , 
automatically extracted from the video. We discuss the set¬ 
ting of T in results. In general, one wants the time window 
spanned by T to include motions that are small enough to 
be label-preserving, so that correct invariances are learned; 
in practice this is typically on the order of a second or less. 

With this dataset, the SFA property translates as 
zg(xj) ss zg(xk),Vpjk = 1. A simple formulation of this 
as an unsupervised regularizing loss would be as follows: 

R' 2 (0M)= E d{z 0 (Xj),Zg(Xk)), (4) 

where d(.,.) is a distance measure (e.g., t\ in PTl and 
I2 in ESI), and M C U2 denotes the subset of “posi¬ 
tive” neighboring frame pairs i.e. those for which pj k = 1. 
This loss by itself admits problematic minimizers such as 
z g {x) = 0,Vx € X, which corresponds to R' 2 = 0. 
Such solutions may be avoided by a contrastive E3 ver¬ 
sion of the loss function that also exploits “negative” (non¬ 
neighbor) pairs; 

R 2 (0,U)= ^2 Ds{zg(Xj),Zg(x k ),p jk ) 

(j,k)eu 2 

= ^ Pjk d{z 9j ,z gk ) +pff max(S - d(z 9j ,z 9k ),0), 

(j,k)eu 2 

(5) 

where z g , denotes z g (x 7 ) and p = 1 — p. As shown above, 
the contrastive loss Dg(a, b,p ) penalizes distance between 
a and b when the pair are neighbors (p = 1), and encour¬ 
ages distance between them when they are not (p = 0), up 
to a margin <5. 

3.3. Higher-order temporal coherence 

The slow feature formulation of Eq encourages fea¬ 
ture maps that produce small first-order temporal deriva¬ 
tives in the learned feature space; dz g {x t ) / dt w 0. This 
first-order temporal coherence is restricted to learning to ig¬ 
nore small jitters in the visual signal. 

Our idea is to model higher order temporal coherence 
in the unlabeled video, so that the features can further cap¬ 
ture rich structure in how the visual content changes over 
time. In the general case, this means we want a regular- 
izer that encourages higher order derivatives to be small: 
d n z g {xt)/dt n ~ 0,Vn = 1,2, ..AT. Accordingly, we need 
to generalize from pairs of temporally close frames to tuples 
of frames. 

In this work, we focus specifically on learning steady 
features—the second-order case, which can be encoded 
with triplets of frames, as we will see next. In a nutshell, 
whereas slow learning insists that the features not change 


too quickly, steady learning insists that feature changes in 
the immediate future remain similar to those in the recent 
past. 

First, we create a triplet dataset U 3 from the unlabeled 
video U as: 

U 3 = {(((, , pimn) '• ttti , x m , x n Id and 

Pimn = 1(0 <m — l=n — m< T)}. (6) 

U 3 indexes image triplets with binary annotations indicating 
whether they are in-sequence, evenly spaced frames in the 
video, within a temporal neighborhood T. In practice, we 
select “negatives” (pimn = 0) from triplets where m — l < 
T but n — m > 2 T to provide a buffer and avoid noisy 
negatives. 

We construct our steady feature analysis regularizer us¬ 
ing these triplets, as follows: 

Rs(Q,lA) = Ds(z e i - z g mi Z 0m Pimn) 1 

(l,m,n)€U3 

(7) 

where z g i is again shorthand for z g (x[ ) and Ds refers to 
the contrastive loss defined above. For positive triplets— 
meaning those occurring in sequence and within a temporal 
neighborhood—the above loss penalizes distance between 
the adjacent pairwise feature difference vectors. For neg¬ 
ative triplets, it encourages this distance, up to a maxi¬ 
mum margin distance S. Effectively, R 3 encourages the 
feature representations of positive triplets to be collinear i.e. 
Z 9 {xi) - z 9 (x m ) « z g {x m ) - z g {x n ). See Figure[Q 

Our final optimization objective combines the first and 
second order losses (Eq ([Jj and lO) into the unsupervised 
regularization term: 

L u (0,U) = R2(0.U) + \'R 3 (9M), (8) 

where X' controls the relative impact of the two terms. Re¬ 
call this regularizer accompanies the classification loss in 
the main objective of Eq ([]}. 

Beyond second-order coherence: The proposed frame¬ 
work generalizes naturally to the n- th order, by defining R n 
analogously to Eq ((TJ) using a contrastive loss over (n — 1)- 
th order discrete derivatives, computed over recursive dif¬ 
ferences on n-tuples. While in principle higher n would 
more thoroughly exploit patterns in video, there are poten¬ 
tial practical drawbacks. As n grows, the number of sam¬ 
ples | U n | would likely need to also grow to cover the space 
of n-frame motion patterns, requiring more training time, 
compute power, and memory. Besides, discrete n-th deriva¬ 
tives computed over large n-frame time windows may grow 
less reliable, assuming steadiness degrades over longer tem¬ 
poral windows in typical visual phenomena. Given these 
considerations, we focus on second-order steadiness com¬ 
bined with slowness, and find that slow and steady does in¬ 
deed win the race (Sec[4]). The empirical question of apply¬ 
ing n > 2 is left for future work. 



Figure 2: “Siamese” network configuration (shared weights for the z g layer stacks) with portions corresponding to the 3 terms L s , R 2 and 
f ?3 in our objective. R 2 and R 3 compose the unsupervised loss L u in Eq 0.L S is the supervised loss for recognition in static images. 


Equivariance-inducing property of R 3 (9M): While 
first-order coherence encourages invariance, the proposed 
second-order coherence may be seen as encouraging the 
more general property of equivariance. z(.) is equivariant 
to an image transformation g if there exists some “simple” 
function i g : 1Z D —> 1Z D such that z (gx) ss f g (z(a;)). 
Equivariance has been found to be useful for visual rep¬ 
resentations OH [361126] H8). To see how feature steadi¬ 
ness is related to equivariance, consider a video with frames 
x t ,l < t < T. Given a small temporal neighborhood At, 
frames tCt+At and x t must be related by a small transfor¬ 
mation g (small because of first order temporal coherence 
assumption) i.e. x t +At = gxt . Assuming second order co¬ 
herence of video, this transformation g itself remains ap¬ 
proximately constant in a small temporal neighborhood, so 
that, in particular, £c t+2 At « gx t +At- 

Now, for equivariant features z(.), by the definition of 
equivariance and the observations above, z,(x t +2At) ~ 
f g (z(x t+ At)) ~ f ,g a ig(z(x t )). Further, given that g is 
a small transformation, i g is well-approximated in a small 
neighborhood by its first order Taylor approximation, so 
that: (1) z(x t+A t) ~ z(x t ) + c(f), and (2) z(x t+2 At) ~ 
z{x t ) + 2c (t). In other words, under the realistic assump¬ 
tion that natural videos evolve smoothly, within small tem¬ 
poral neighborhoods, feature equivariance is equivalent to 
the second order temporal coherence formulated in Eq ©, 
with l, m, n set to t, t + At, t + 2At respectively. This con¬ 
nection between equivariance and the second order tempo¬ 
ral coherence induced by R 3 helps motivate why we can 
expect our feature learning scheme to benefit recognition. 

3.4. Neural networks for the feature maps 

We use a convolutional neural network (CNN) archi¬ 
tecture to represent the feature mapping function z q(.). 
The parameter vector 9 represents the CNN’s learned layer 
weight matrices. See Sec 14.11 and Supp for architecture 
choices. 

To optimize Eq (E with the regularizer in Eq ®, we 
employ standard mini-batch stochastic gradient descent (as 
implemented in ll20l ) in a “Siamese” setup, with 6 replicas 
of the stack zg{.), as shown in Fig [2] 1 stack for L s (input: 
supervised training samples Xi), 2 for R 2 (input: tempo¬ 
ral neighbor pairs (xj, Xk)) and 3 for R 3 (input: triplets 
( xi, x m , x n )). The shared layers are initialized to the same 


random values and modified by the same gradients (sum of 
the gradients of the 3 terms) in each training iteration, so 
they remain identical throughout. See Supp for details. 

4. Experiments 

We test our approach using five challenging pub¬ 
lic datasets for three tasks—object, scene, and action 
recognition—spanning 432 categories. We also analyze its 
ability to learn higher order temporal coherence with a se¬ 
quence completion task. 

4.1. Experimental setup 

Our three recognition tasks (specified by the names of 
the unsupervised and supervised datasets asU — S) are 
NORB—>NORB object recognition, KITTI—>-SUN scene 
recognition and HMDB—^PASCAL-10 single-image action 
recognition. TableE(left) summarizes key dataset statistics. 

Supervised datasets S: (1) NORB (25) has 972 images 
each of 25 toys against clean backgrounds captured over a 
grid of camera elevations and azimuths. (2) SUN l44l con¬ 
tains Web images of 397 scene categories. (3) PASCAL- 
10 ® is a still-image human action recognition dataset with 
10 categories. For all three datasets, we use few labeled 
training images (see Table E- since unsupervised regular¬ 
ization schemes should have most impact when labeled data 
is scarce [H3 ED. This is an important scenario, given the 
“long tail” of categories lacking ample labeled exemplars. 

Unsupervised datasets U: (1) NORB consists of pose- 
registered turntable images (not video), but it is straightfor¬ 
ward to generate the pairs and triplets for U 2 and U 3 assum¬ 
ing smooth motions in the annotated pose space. We mine 
these pairs and triplets from among the 648 images per class 
that are not used for testing. (2) KITTI fTOl has videos cap¬ 
tured from a car-mounted camera in a variety of locations 
around the city of Karlsruhe. Scenes are largely static ex¬ 
cept for traffic, but there is large and systematic camera mo¬ 
tion. (3) HMDB 11241 contains 6849 short Web and movie 
video clips containing 51 diverse actions. We select 1000 
clips at random. While some videos include camera mo¬ 
tion ( e.g. to follow an athlete running), most have stationary 
cameras and small human pose-change motions. The time 
window T is a hyperparameter of both our method as well 















































































































Task 

Img/frame dims 

#Classes 

Recog. Task 

#Train 

#Test 

Unsup. Input Type 

#Pairs (1:3) 

#Triplets (1:1) 

Datasets-A 

NORB 

KITTI 

HMDB 

NORB-aNORB 

96x96x1 

25 

object 

150 

8100 

pose-reg. images 

50,000 

75,000 

SFA-1 (IT) 

0.95 

31.04 

2.70 

KITTI—> SUN 

32x32x1 

397 

scene 

2382 

7940 

car-mounted video 

100,000 

100,000 

SFA-2 O 

0.91 

8.39 

2.27 

HMDB ^PASCAL-10 

32x32x3 

10 

action 

50 

2000 

web video 

100,000 

100,000 

SSFA (ours) 

0.53 

7.79 

1.78 


Table 1: Left: Statistics for the unsupervised and supervised datasets (U —¥ S) used in the recognition tasks (positive to negative ratios for 
pairs and triplets indicated in headers). Right: Sequence completion normalized correct candidate rank 77 . Lower is better. (See Sec 14.21 1 


as existing SFA methods. We fix T = 2 and T = 0.5 sec¬ 
onds for KITTI and HMDB, respectively, based on cross- 
validation for best performance by the SFA baselines. 

Baselines: We compare our slow-and-steady feature anal¬ 
ysis approach (SSFA) to four methods, including two key ex¬ 
isting methods for learning from unlabeled video. The three 
unsupervised baselines are: (1) UNREG: An unregularized 
network trained only on the supervised training samples S. 
(2) SFA-1: An SFA approach proposed in 11311 that uses £\ 
for d(.) in EqO (3) SFA-2: Another SFA variant fl5l that 
sets the distance function d(.) to the £2 distance in Eq 0 
The SFA methods train with the unlabeled pairs, while SSFA 
trains with both the pairs and triplets. 

These comparisons are most crucial to gauge the impact 
of the proposed approach versus the state of the art for fea¬ 
ture learning with unlabeled video. However, we are also 
interested to what extent learning from unlabeled video can 
even start to compete with methods learned from heavily la¬ 
beled data (which costs substantial human effort). Thus, we 
also compare against a supervised pretraining and finetun- 
ing approach denoted SUP-FT (details in Sec 14.3k 

Network architectures: For the NORB—^NORB task, 
we use a fully connected network architecture: input —»• 
25 hidden units —> ReLU nonlinearity -a D= 25 features. 
For the other two tasks, we resize images to 32 x 32 to al¬ 
low fast and thorough experimentation with standard CNN 
architectures known to work well with tiny images S2 , pro¬ 
ducing D=64-dimensional features. Recognition tasks on 
32x32 images are much harder than with full-sized im¬ 
ages, so these are highly challenging tasks. All networks 
are optimized with Nesterov-accelerated stochastic gradi¬ 
ent descent until validation classification loss converges or 
begins to increase. Optimization hyperparameters are se¬ 
lected greedily through cross-validation in the following or¬ 
der: base learning rate, A and A' (starting from A=A'=0). 
The relative scales of the margin parameters S of the con¬ 
trastive loss Dg(.) in Eq (0 and Eq 0 are validated per 
dataset. See Supp for more details on the 32 x 32 architec¬ 
ture, data pre-processing and optimization. 

4.2. Quantifying steadiness 

First we use a sequence completion task to analyze 
how well the desired steadiness property is induced in the 
learned features. We compose a set of sequential triplets 


from the pool of test images, formed similarly to the posi¬ 
tives in Eq 0. At test time, given the first two images of 
each triplet, the task is to predict what the third looks like. 

We apply our SSFA to infer the missing triplet item as 
follows. Recall that our formulation encourages sequen¬ 
tial triplets to be collinear in the feature space. As a re¬ 
sult, given zg(xi) and ze(a; 2 ), we can extrapolate 
as £ 0 ( 2 : 3 ) = 2 zg(x 2 ) — ze(ati). To backproject to the im¬ 
age space, we identify an image closest to £ 0 ( 0 : 3 ) in feature 
space. Specifically, we take a large pool C of candidate im¬ 
ages, map them all to their features via Z 0 , and rank them 
in increasing order of distance from £ 0 ( 0 : 3 ). The rank r 
of the correct candidate 0:3 is now a measure of sequence 
completion performance. See Supp for details. 

Tab HI (right) reports the mean percentile rank 77 = 
E[r/|C|] x 100 over all query pairs. Lower 77 is better. 
Clearly, our SSFA regularization induces steadiness in the 
feature space, reducing 77 nearly by half compared to base¬ 
line regularizers on NORB and by large margins on HMDB 
too. Our regularizer R 3 is closely matched to this task, so 
these gains are expected. Note however that these gains 
are reported after training to minimize the joint objective, 
which includes L s and f? 2 , apart from R$, and with regu¬ 
larization weights tuned for recognition tasks. 

Fig [3] shows sequence completion examples from all 3 
video datasets. Particularly impressive results are the third 
NORB example (where despite a difficult viewpoint, the se¬ 
quence is completed correctly by the top-ranked candidate), 
and the third HMDB example, where a highly dynamic 
baseball pitch sequence is correctly completed by the third 
ranked image. The top-ranked candidate for this example il¬ 
lustrates a common failure mode—the second image of the 
query pair is itself picked to complete the sequence. This 
may reflect the fact that HMDB sequences in particular ex¬ 
hibit very little motion (camera motions rare, mostly small 
object motions). Usually, as in the third KITTI example, 
even the top-ranked candidates other than the ground truth 
frame are highly plausible completions. 

4.3. Recognition results 

Unlabeled video as a prior for supervised recognition: 

Now we report results on the 3 unsupervised-to-supervised 
recognition tasks. Table [2] shows the results. Our SSFA 
method comprehensively outperforms not only the purely 
supervised UNREG baseline, but also the popular SFA- 1 and 
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Figure 3: Sequence completion examples from all three video datasets. In each instance, a query pair is presented on the left, and the top 
three completion candidates as ranked by our method are presented on the right. Ground truth frames are marked with black highlights. 


Task type—» 

Objects 


Scenes 

Actions 

Datasets —> 

NORB->NORB 

KITTI—kSUN 

HMDB—^PASCAL-10 

Methods^ 

[25 els] 

[397 els] 

[397 els, top-10] 

[10 els] 

random 

4.00 

0.25 

2.52 

10.00 

UNREG 

24.64±0.85 

0.70±0.12 

6.10±0.67 

15.34±0.28 

SFA-1 |3T) 

37.57±0.85 

1.21±0.14 

8.24±0.25 

19.26±0.45 

sfa-2 dg 

39.23it0.94 

1.02±0.12 

6.78±0.32 

19.04±0.24 

SSFA (ours) 

42.83±0.33 

1.65±0.04 

9.19±0.10 

20.95±0.13 


Table 2: Recognition results (mean ± standard error of accuracy 
% over 5 repetitions) (Sec 14.3k Our method outperforms both ex¬ 
isting slow feature/temporal coherence methods and the unregular¬ 
ized baseline substantially, across three distinct recognition tasks. 

SFA-2 slow feature learning approaches, beating the best 
baseline for each task by 9%, 36% and 9% respectively. 
The results on KITTI- >SUN and HMDB—/PASCAL-10 
are particularly impressive because the unsupervised and 
supervised dataset domains are mismatched. All KITTI 
data comes from a single car-mounted road-facing cam¬ 
era driving through the streets of one city, whereas SUN 
images are downloaded from the Web, captured by differ¬ 
ent cameras from diverse viewpoints, and cover 397 scene 
categories mostly unrelated to roads. PASCAL-10 images 
are bounding-box-cropped and therefore centered on single 
persons, while HMDB videos, which are mainly clips from 
movies and Web videos, often feature multiple people, are 
not as tightly focused on the person performing the action, 
and are of low quality, sometimes with overlaid text etc. 

Aside from the diversity of tasks (object, scene, and ac¬ 
tion recognition), our unsupervised datasets also exhibit di¬ 
verse types of motion. NORB is generated from planned, 
discrete camera manipulations around a central object of 
interest. The KITTI camera moves through a real largely 
static landscape in smooth motions on roads at varying 
speeds. HMDB videos on the other hand are usually cap¬ 
tured from stationary cameras with a mix of large and small 
foreground and background object motions. Even the dy¬ 
namic camera videos in HMDB are sometimes captured 
from hand-held devices leading to jerky motions, where our 
temporal steadiness assumptions might be stressed. 


Pairing unsupervised and supervised datasets: Thus 
far, our pairings of unsupervised and supervised datasets 
reflect our attempt to learn from video that a priori seems 
related to the ultimate recognition task, e.g. HMDB human 
action videos are paired with PASCAL- 10 Action still im¬ 
ages. However, as discussed above, the domains are only 
roughly aligned. Curious about the impact of the choice 
of unlabeled video data, we next try swapping out HMDB 
for KITTI in the PASCAL action recognition task. On this 
new KITTI—^PASCAL task, we still easily outperform our 
nearest baseline, although our gain drops by « 0.9% (SFA- 
2:19.06% vs. our SSFA:20.01%). Despite the fact that the 
human motion dynamics of HMDB ostensibly match the ac¬ 
tion recognition task better than the egomotion dynamics of 
KITTI (where barely any people are visible), we maintain 
our advantage over the purely slow methods. This indicates 
that there is reasonable flexibility in the choice of unlabeled 
videos fed to SSFA. 

Increasing supervised training sets: Thus far, we have 
kept labeled sets small to simulate the “long tail” of cate¬ 
gories with scarce training samples where priors like ours 
and the baselines’ have most impact. In a preliminary study 
for larger training pools, we now increase SUN training 
set sizes from 6 to 20 samples per class for KITTI—»SUN. 
Our method retains a 20% gain over existing slow methods 
(SSFA: 3.24% vs SFA-2: 2.65%). This suggests our ap¬ 
proach is valuable even with larger supervised training sets. 
Varying unsupervised training set size: To observe the 
effect of unsupervised training set size, we now restrict 
SSFA to use varying-sized subsets of unlabeled video on the 
HMDB—kPASCAL-10 task. Performance scales roughly 
log-linearly with the duration of video observed^ suggest¬ 
ing that even larger gains may be achieved simply by train¬ 
ing SSFA with more freely available unlabeled video. 
Purely unsupervised feature learning: We now evalu¬ 
ate the usefulness of features trained to optimize the un¬ 
supervised SSFA loss L u (Eq (H|) alone. Features trained 
on HMDB are evaluated at various stages of training, on 

2 At 3. 12.5, 25, and 100% resply. of the full unlabeled dataset (~32k 
frames), performance is 18.06, 19.74, 20.36, and 20.95% (see Supp) 
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Figure 4: Comparison to CIFAR-100 supervised pretraining SUP- 
FT, at various supervised training set sizes. Flat dashed lines re¬ 
flect that our method (and SFA) always use zero additional labels. 

the task of /c-nearest neighbor classification on PASCAL- 
10 (k =5, and 100 training images per action). Start¬ 
ing at « 17.8% classification accuracy for randomly ini¬ 
tialized networks, unsupervised SSFA training steadily im¬ 
proves the discriminative ability of features to 19.62, 20.32 
and 22.14% after 1, 2 and 3 passes respectively over train¬ 
ing data (see Supp). This shows that SSFA can train useful 
image representations even without jointly optimizing a su¬ 
pervised objective. 

Comparison to supervised pretraining and finetuning: 

Recently, a two-stage supervised pretraining and finetun¬ 
ing strategy (SUP-FT) has emerged as the leading approach 
to solve visual recognition problems with limited training 
data where high-capacity models like deep neural networks 
may not be directly learned [Til [7] [33] | 2 T]. In the first 
stage (“supervised pretraining”), a neural network “NET1” 
is first trained on a related problem for which large training 
datasets are available. In a second stage (“finetuning”), the 
weights from NET1 are used to initialize a second network 
(“NET2”) with similar architecture. NET2 is then trained 
on the target task, using reduced learning rates to minimally 
modify the features learned in NET1. 

In principle, completely unsupervised feature learning 
approaches like ours have important advantages over the 
SUP-FT paradigm. In particular, (1) they can leverage es¬ 
sentially infinite unlabeled data without requiring expensive 
human labeling effort thus potentially allowing the learning 
of higher capacity models and (2) they do not require the 
existence of large “related” supervised datasets from which 
features may be meaningfully transferred to the target task. 
While the pursuit of these advantages continues to drive vig¬ 
orous research, unsupervised feature learning methods still 
underperform supervised pretraining for image classifica¬ 
tion tasks, where great effort has gone into curating large 
labeled databases, e.g., ImageNet (6), CIFAR ll22l . 

As a final experiment, we examine how the proposed un¬ 
supervised feature learning idea competes with the popular 
supervised pretraining model. To this end, we adopt the 
CIFAR-100 dataset consisting of 100 diverse object cate¬ 
gories as a basis for supervised pretrainingQ The new base- 

3 We choose CIFAR-100 for its compatibility with the 32 X 32 images 


line SUP-FT trains NET1 on CIFAR (see Supp), then fine- 
tunes NET2 for either PASCAL-10 action or SUN scene 
recognition tasks using the exact same (few) labeled in¬ 
stances given to our method. In parallel, our method “pre¬ 
trains” only via the SSFA regularizer learned with unlabeled 
HMDB / KITTI video respectively for the two tasks. Our 
method uses zero labeled CIFAR data. 

Fig [4]shows the results. On PASCAL-10 action recog¬ 
nition (left), our method significantly outperforms SUP-FT 
pretrained with all 50,000 images of CIFAR-100! Gather¬ 
ing image labels from the crowd for large multi-way prob¬ 
lems can take on average 1 minute per image 8351 . meaning 
we are getting better results while also saving ~ 830 hours 
of human effort. On SUN scene recognition (right), SSFA 
outperforms SUP-FT with 5K labels and remains competi¬ 
tive even when the supervised method has a 17,500 label 
advantage. However, SUP-FT-50k’s advantage on the SUN 
task is more noticeable; its gain is similar to our gain over 
the best slow-feature method. 

The upward trend in accuracy for SUP-FT with more 
CIFAR-100 labeled data indicates that it successfully trans¬ 
fers generic recognition cues to the new tasks. On the other 
hand, the fact that it fares worse on PASCAL actions than 
SUN scenes reinforces that supen’ised transfer depends on 
having large curated datasets in a strongly related domain. 
In contrast, our approach successfully “transfers” what it 
learns from purely unlabeled video. In short, our method 
can achieve better results with substantially less supervi¬ 
sion. More generally, we view it as an exciting step towards 
unlabeled video bridging the gap between unsupervised and 
supervised pretraining for visual recognition. 

5. Conclusion 

We formulated an unsupervised feature learning ap¬ 
proach that exploits higher order temporal coherence in un¬ 
labeled video, and demonstrated its powerful impact for 
several recognition tasks. Despite over 15 years of research 
surrounding slow feature analysis (SFA), its variants and ap¬ 
plications, to the best of our knowledge, we are the first 
to identify that SFA is only the first order approximation 
of a more general temporal coherence idea. This basic ob¬ 
servation leads to our intuitive approach that can be easily 
plugged into applications where first order temporal coher¬ 
ence has already been found useful OTl [3] [47] [13] [42] [15] 
l46l 1451 [32l l28l . To our knowledge, ours are the first re¬ 
sults where unsupervised learning from video actually sur¬ 
passes the accuracy of today’s favored approach, heavily 
supervised pretraining. 
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used throughout our results, which let us leverage standard CNN architec¬ 
tures known to work well with tiny images [I]. 
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Figure 5: 32x32 CNN architecture used for the KITTI—>SUN 
and HMDB—>PASCAL-10 tasks 


6. Appendix 

We now provide supplementary details on (1) the CNN archi¬ 
tecture used in our SUN and PASCAL-10 experiments, (2) the se¬ 
quence completion task used to quantify steadiness, (3) our ex¬ 
periments with varying sizes of unsupervised training datasets, 
(4) our experiments with purely unsupervised feature learning, (5) 
pre-processing steps for the datasets used in our experiments, (6) 
optimization-related details, and (7) details of the supervised pre¬ 
training and finetuning baseline SUP-FT from the paper. We also 
show samples of all the real image datasets used in our experi¬ 
ments. 

32x32 images CNN architecture: The 32x32 CNN ar¬ 
chitecture m representing z g, used for the KITTI—>-SUN and 
HMDB—>PASCAL-10 tasks is shown in Fig[5] 

Quantifying steadiness - details As described in the main 
paper (Sec 4.2), the candidate set C for NORB was straightfor¬ 
ward to construct - the entire NORB test image set was used. 
For the video datasets KITTI and HMDB though, it would have 
been practically difficult to include all image frames in the can¬ 
didate set C. To avoid having to compute features and perform 
nearest neighbor search over too large a number of frames, we 
formed a randomly sub-sampled C instead, as follows. Starting 
from empty C, we added (1) all the unique images among the query 
pairs (2) their corresponding ground truth completion images and 
(3) a minimum number N of randomly chosen frames from each 
video represented within C until this point. This ensures that the 
task is non-trivial by adding distractors from the same video as 
the ground truth candidate image, which are likely to have simi¬ 
lar appearance. We used IV=10 for KITTI and N=5 for HMDB 
to keep the total numbers of images manageable. Finally, we se¬ 
lect from \C\ =8100, 5000 and 5000 candidates respectively for 
NORB, KITTI and HMDB, for each of N =20,000, 1000 and 
1,000 query pairs respectively for the three datasets. 

Varying unsupervised training set size: To observe 
the effect of unsupervised training set size, we now restrict 
SSFA to use varying-sized subsets of unlabeled video on the 
HMDB—^PASCAL-10 task. The full HMDB dataset has approx¬ 
imately 1000 videos, for a total of «32000 frames. Performance 
scales roughly log-linearly with the duration of video observed as 
shown in Fig(6] suggesting that even larger gains may be achieved 
simply by training SSFA with more freely available unlabeled 
video. 



Figure 6: SSFA classification accuracy vs. duration of unsuper¬ 
vised video (mean, standard error over 5 runs). 



Training minibatch^ 1 Q4 

Figure 7: SSFA k-NN accuracy improvement with SSFA training 
(mean, standard error over 5 runs). 

Purely unsupervised feature learning: We evaluate the 
usefulness of features trained to optimize the unsupervised SSFA 
loss L u (main paper Eq ([8j) alone. Features trained on HMDB 
are evaluated at various stages of training, on the task of fc-nearest 
neighbor classification on PASCAL-10 (k =5, and 100 training 
images per action). Fig[7]shows the results. Starting at ~ 17.8% 
classification accuracy for randomly initialized networks, unsuper¬ 
vised SSFA training steadily improves the discriminative ability of 
features. This shows that SSFA can train useful image representa¬ 
tions even without jointly optimizing a supervised objective. 

Dataset pre-processing details For all tasks, images are 
mean-subtracted and contrast-normalized before passing to the 
neural networks. In addition, for KITTI—>SUN, full KITTI frames 
were resized to 32x32 and SUN images were cropped to KITTI 
aspect ratio before resizing to the same dimensions. Grayscale 
images were used in this task. Similarly, for HMDB—^PASCAL- 
10, HMDB frames were cropped to centered squares, and 
PASCAL-10 bounding boxes were expanded to the closest square 
before resizing to 32x32. Resizing for KITTI—^-SUN and 
HMDB—>PASCAL-10 was done to allow fast and thorough exper¬ 
imentation with standard CNN architectures known to work well 
with tiny images Q). On the SUN dataset apart from resizing, 
where we also lose information due to KITTI-aspect-ratio crop¬ 
ping, we verified that our baselines were legitimate by running a 
simple nearest neighbor baseline in the pixel space (standard ap¬ 
proach for tiny images). This achieved 0.61% accuracy compared 






























to UNREG’s 0.70%, given the same training data. 


Optimization details We initialized according to the scheme 
proposed in (ED, and run Nesterov accelerated stochastic gradi¬ 
ent descent using the open source Caffe |20] package. The base 
learning rate and regularization As are selected with greedy cross- 
validation|3 Specifically, for each task, the optimal base learning 
rate (from 0.1, 0.01, 0.001, 0.0001) was first identified for UNREG. 
Next A was set through a logarithmic grid search (steps of 10°’ 4 5 ), 
with A' set to 0 i.e. this parameter was optimized for SFA-2. The 
margin parameter S of the contrastive loss in i? 2 (.) was set to 1.0 
for all methods - this affects the objective function only up to a 
feature scaling operation, and so may be set to any positive value. 
For SSFA, a similar search was then performed over X' (logarith¬ 
mic grid search with steps of 10°' 5 ), and then a small search for 
the contrastive loss margin 5 in i? 3 (.) (over 0, 0.1 and 1). Setting 
the margin to 8 = 0 in a contrastive loss reduces it to the simple 
distance loss over positive samples. 

On a single Tesla K-40 GPU machine, NORB ->N()RB train¬ 
ing tasks took »G0 minutes, KITTI—>SUN tasks took « 90 min¬ 
utes, and HMDB-fPASCAL- 10 tasks took «60 minutes. SSFA 
training took about 2x training time and 1.5x training epochs to 
converge, compared to SFA baselines, because of the more com¬ 
plex loss function. 


Supervised pretraining and finetuning - details For the 

supervised pretraining and finetuning comparison experiments in 
Sec 4.3, we used the same neural network architecture as used for 
our approach and other baselines on the SUN scene and PASCAL- 
10 action recognition tasks (architecture shown in Fig[5]). A 100- 
way softmax classifier was trained on the 64-dimensional final 
layer features to classify CIFAR-100 classes during pretraining, 
but these classifier weights are ignored for supervised transfer. 
All other weights in the network are used to set the correspond¬ 
ing weights on the network to be trained for the target task. For 
SUN (397 classes x 5 images per class), we found it beneficial to 
finetune features by reducing the learning rate for the pretrained 
layers by a factor of 0.1 compared to the full learning rate used to 
train the 397-way classifier on top. For PASCAL-10 (10 classes x 
5 images per class), only the 10-way action classifier was trained 
starting from random weights, while the weights of lower layers 
were frozen to their pretrained values, since finetuning was found 
to adversely impact classification results. 


Dataset sample images Some sample images of KITTI, 
SUN, HMDB-51 and PASCAL-10 are shown at the end of this 
document. 


4 our validated (A,A') values for NORB—4MORB, KITTI-s-SUN, and 

HMDB—^-PASCAL respectively are (0.1,0.3),(3,0.1), and (0.3,1) 
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